🧠 How does Attention work?
Attention is a key component of the Transformer model, which is a type of neural network architecture that is widely used in natural language processing (NLP) tasks. The Transformer model uses attention to capture the relationships between different parts of the input sequence and the output sequence.
Attention works by assigning weights to each element in the input sequence based on its relevance to the current output element. The weights are then used to compute a weighted sum of the input sequence, which is then used as the input to the next layer of the model.
That maybe too vague, so let’s break it down with an example.
Example:
Suppose we have an input sequence of length n and an output sequence of length m. We want to use the Transformer model to translate the input sequence into the output sequence.
- Let’s implement a
SelfAttentionusing PyTorch. First, we define three linear transformation matrices: query, key, and value.
class BertSelfAttention(nn.Module):
def __init__(self, config):
super(BertSelfAttention, self).__init__()
self.query = nn.Linear(config.hidden_size, self.all_head_size) # config.hidden_size 768, self.all_head_size 768
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
def forward(self,hidden_states): # hidden_states dim(L, 768)
Q = self.query(hidden_states)
K = self.key(hidden_states)
V = self.value(hidden_states)Note that query, key, and value here are just names for operations (linear transformations). The actual Q/K/V are the outputs of these three.
- Suppose the input to these operations is the same matrix (for now, let’s not worry about why the input is the same). Let’s assume it’s a sentence of length L, where each token has a feature dimension of 768. The input would be of shape (L, 768), with each row representing a word, like this:
- Next, we have a self attention operation like this: {\text{Attention}}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_k}}\right)V
First, take the first row of Q, which represents the 768 features of the word “我” (meaning “I” in Chinese), and perform a dot product with the 768 features of “我” from the K matrix. The result is the value at position (0, 0) in the output matrix. This value represents the attention weight of the word “我” in the sentence “我想吃酸菜鱼” (“I want to eat pickled fish”).
Naturally, the first row of the output matrix represents the attention weights of the word “我” on every word in the sentence “我想吃酸菜鱼.” Similarly, the entire result is a matrix that shows how much attention each word in the sentence gives to every other word (including itself), represented as a numerical value.
- Next, we divide by the square root of dim, where dim is 768.
Why do we divide by this value? The primary reason is to scale down the range of the dot product results. This ensures the stability of the gradients when applying the softmax function. For a more detailed explanation, you can refer to resources here and here.
Why using Softmax? Applying the softmax function serves two main purposes:
Non-Negativity of Attention Weights: Softmax ensures that all the attention weights are non-negative and sum to 1. This allows them to be interpreted as probabilities.
Adding Non-Linearity: The softmax function introduces non-linearity, which helps the model learn more complex representations.
There have been studies and experiments on removing softmax from the attention mechanism, such as in the PaperWeekly exploration on linear attention: “Is Softmax Necessary for Attention?”. These works investigate the impact and necessity of the softmax function in attention mechanisms.
- dot product between the attention weights and the V matrix: (L, L) * (L, 768) = (L, 768)
First, let’s take the attention weights of the word “我” (meaning “I”) on every word in the sentence “我想吃酸菜鱼” (“I want to eat pickled fish”). These weights are then multiplied by the first dimension of the features in the V matrix for each word in the sentence. The sum of these products gives a weighted sum of the features, essentially combining the features of each word based on how much attention “我” gives to each word.
This process is then repeated for the second dimension of the features, and so on, for all dimensions. In the end, you obtain a result matrix with shape (L, 768), which matches the shape of the input matrix.
This means that the output of the self-attention mechanism preserves the original dimensionality of the input, while encoding the contextual information by focusing on different parts of the sentence based on the learned attention weights.
class BertSelfAttention(nn.Module):
def __init__(self, config):
super(BertSelfAttention, self).__init__()
self.query = nn.Linear(config.hidden_size, self.all_head_size) # config.hidden_size 768, self.all_head_size 768
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
def forward(self,hidden_states): # hidden_states dim(L, 768)
Q = self.query(hidden_states)
K = self.key(hidden_states)
V = self.value(hidden_states)
attention_scores = torch.matmul(Q, K.transpose(-1, -2))
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
attention_probs = nn.Softmax(dim=-1)(attention_scores)
out = torch.matmul(attention_probs, V)
return outWhy is it called a self-attention network? The reason is that the Q, K, and V matrices are all derived from the same input sentence. Following the process described above, it essentially means that the attention weights are calculated based on each word’s relationship to every other word (including itself) within the same sentence.
What If It’s Not Self-Attention? In contrast, if it’s not self-attention, the query matrix Q might come from sentence A, while the key K and value V matrices come from sentence B. This would mean the attention weights are calculated based on the relationship between different sentences.
Positional Invariance in Attention Mechanisms It’s important to note that if you swap any two words’ positions in the K and V matrices, the final result remains unaffected. This is because the attention mechanism inherently lacks positional information, unlike CNNs, RNNs, or LSTMs, which do consider the order of elements.
Why Positional Embedding Is Necessary Because the attention mechanism does not inherently consider the position of words, positional embeddings are introduced. Positional embeddings provide information about the order of words, helping the model understand the sequence structure of the input data.
Position Embedding
However, with a bit of thought, it becomes apparent that such a model cannot capture the order of a sequence. In other words, if you shuffle the rows of the K and V matrices (equivalent to shuffling the word order in a sentence), the result of the attention mechanism remains the same.
This observation reveals a significant limitation: the Attention model, up to this point, is at most a very sophisticated “bag of words” model. This is a serious problem because, for time series data—especially in NLP tasks—sequence order is crucial information. The order represents both local and global structures. If a model cannot learn the order information, its performance will be significantly diminished. For instance, in machine translation, the model might translate each word correctly, but fail to organize them into a coherent sentence.
To address this issue, Google introduced Position Embedding—also known as “position vectors.” This technique involves assigning a unique identifier to each position in the sequence, with each identifier corresponding to a vector. By combining position vectors with word vectors, the model introduces positional information for each word, enabling the attention mechanism to distinguish between words in different positions.
In RNN and CNN models, Position Embedding had been more of an auxiliary method—beneficial, but not essential. These models could already capture positional information to some extent. However, in a pure Attention model, Position Embedding becomes the sole source of positional information, making it a core component of the model, rather than just a helpful addition.
Traditionally, Position Embeddings were vectors trained specifically for the task at hand. Google, however, introduced a formula to construct Position Embeddings, providing a systematic way to encode positional information. \text{PE}(pos,2i)=sin\left(\frac{pos}{10000^{2i/d_{model}}}\right), \text{PE}(pos,2i+1)=cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)
In their paper, Google mentioned that they compared position vectors trained directly with those calculated using the aforementioned formula, and the results were similar. Therefore, it’s clear that we prefer using the formula to construct the Position Embedding.
Position Embedding inherently provides absolute positional information, but relative position is also crucial in language understanding. One important reason Google chose the specific formula for position vectors is based on trigonometric identities:
sin(x+y) = sin(x)cos(y) + cos(x)sin(y) cos(x+y) = cos(x)cos(y) - sin(x)sin(y) These identities suggest that the vector for position p+k can be represented as a linear transformation of the vector for position p, which opens up the possibility of expressing relative positional information.
Intuitively, adding the vectors might seem to lead to information loss, which could make it seem less desirable. However, Google’s results demonstrated that adding the vectors is also an effective approach. This shows that while the concern about potential information loss might be valid in theory, in practice, the method of addition works well.
The word embedding and the position embedding are then added and dropout is applied to produce the final token embedding.
In summary, self-attention is “self” because it considers relationships within the same input, and positional embeddings are used to compensate for the lack of inherent position-awareness in the attention mechanism.
Encoder and Decoder
The encoder consists of N identical blocks applied one after another. Each encoder block has two sub-layers: a self-attention layer followed by a position-wise fully-connected network. The block also incorporates layer normalization layers which are added before the sub-layers, and dropout layers added after the sub-layers. Finally, a residual connection is applied after both the self-attention and the fully-connected layers.The decoder is also composed of N identical blocks. The decoder block is very similar to the encoder block, but with two differences:
- The decoder uses masked self-attention, meaning that current sequence elements cannot attend future elements.
- In addition to the two sub-layers, the decoder uses a third sub-layer, which performs cross-attention (not self-attention) between the decoded sequence and the outputs of the encoder.
🧠 Multi-Head Attention
\text{MultiHead}\left(\textbf{Q}, \textbf{K}, \textbf{V}\right) = \left[\text{head}_{1},\dots,\text{head}_{h}\right]\textbf{W}_{0} \text{where} \text{ head}_{i} = \text{Attention} \left(\textbf{Q}\textbf{W}_{i}^{Q}, \textbf{K}\textbf{W}_{i}^{K}, \textbf{V}\textbf{W}_{i}^{V} \right) Above W are all learnable parameter matrices.
The Multi-Head Attention mechanism allows the model to jointly attend to information from different representation subspaces at different positions. This is achieved by performing attention on different representation subspaces in parallel, resulting in multiple attention distributions. The result is a concatenation of these distributions, which is then fed through a linear transformation layer to produce the output.